You are looking at an interactive notebook that comprises R code and text. It can be used to document your analyses and produce reports. An R code chunk looks like this :-
Any R code can be executed be placing your cursor on the same and pressing CTRL and ENTER, or clicking the green triangle on the right of the pane.
Try this now
print("Hello World")
[1] "Hello World"
🎉
R has a rich heritage of packages for manipulating and visualising data and performing statistical analysis. One particular set of packages, the tidyverse has gained a lot of attention recently as it supports a lot of common data manipulation operations.
This cheatsheet summarises the main data manipulation functions of the package and the documentation is excellent.
The tidyverse functionality is not automatically built into R. We have to load the tidyverse library every time we start R.
library(tidyverse)
We will illustrate some of the main functions from tidyverse with example data from COSMIC. An example file is freely available, but you will need to register for full-size file. This is an archive of files. Each has 100 lines of the the full files (that you have to register to access)
download.file is an R function for downloading a file into your working directory (the directory where R is currently looking to open files from, and save files to).
download.file("https://cog.sanger.ac.uk/cosmic/taster/example_grch38_cosmic.tar",destfile = "example_grch38_cosmic.tar")
We use R function, untar to extract files from the archive. You should see that a cosmic_example directory has been created.
untar(tarfile = "example_grch38_cosmic.tar",exdir="cosmic_example")
We import data so that we can analyse it in R using one of the various functions that are called read_... The choice depends on the type of file (csv, tsv, etc…)
mutants <- read_tsv("cosmic_example/CosmicMutantExport.tsv")
Parsed with column specification:
cols(
.default = col_character(),
`Gene CDS length` = [32mcol_double()[39m,
`HGNC ID` = [32mcol_double()[39m,
ID_sample = [32mcol_double()[39m,
ID_tumour = [32mcol_double()[39m,
GRCh = [32mcol_double()[39m,
`FATHMM score` = [32mcol_double()[39m,
Pubmed_PMID = [32mcol_double()[39m,
ID_STUDY = [32mcol_double()[39m,
Age = [32mcol_double()[39m
)
See spec(...) for full column specifications.
The <- symbol means that we are storing the data in R’s memory for future use. Whenever we want to see what the data look like, or manipulate them, we can use the name of the “variable” that we have specified; mutants.
This prints the data to the screen, but in a clever way so that we can cycle through the entries.
mutants
mutants is an example of a data frame, which is R’s way of storing tabular data where we potentially have different types of data (text, numbers, dates) in each column.
Selecting columnsWe can use the select function to print only the columns we are interested in by listing their names with a , in between. Column names with a space need to have single quote marks around them ```
select(mutants, `Gene name`, `Primary site`, `Mutation ID`)
If the columns we want are consecutive there is a handy shortcut of :
select(mutants, `Gene name`:`Sample name`)
Or if we known that the columns all have a particular name in common.
select(mutants, contains("Mutation"))
filtering the rowsThe select function prints all the rows to the screen. There are many ocassions when we might want to print rows that meet a certain criteria.
e.g. to find all mutations that are pathogenic.
The == symbol is used to test what entries in the FATHMM prediction column are the same as PATHOGENIC
filter(mutants, `FATHMM prediction`=="PATHOGENIC")
Similar example but looking for a given gene
filter(mutants, `Gene name`=="CMYA5")
filter can also be used when we have numeric data
e.g. what mutations occur in individuals under 50
filter(mutants, Age<50)
The mutate function can be used to add new columns based on existing ones. An extremely useful function is separate that can split columns that appear to contain more than one piece of information.
e.g. extract chromosome, start and end from the genome location
mutants <- separate(mutants, `Mutation genome position`, into= c("Chromosome","Start","End"))
Data can be sorted according to column with the arrange function; either ascending of descending
arrange(mutants, `Gene CDS length`)
arrange(mutants, desc(`Gene CDS length`))
Sorting also working on character columns
arrange(mutants, `Primary site`)
The real power of tidyverse lies in the fact that we can chain operations together. Consider that we want an output table of all mutants in lung that are pathogenic, ordered by genomic location, displaying only the location, gene name and mutation ID. Using the R code we have written so far we could do this as:-
mutants2 <- filter(mutants, `Primary site` == "lung",`FATHMM prediction` == "PATHOGENIC")
mutants3 <- arrange(mutants2,Chromosome,Start)
mutants4 <- select(mutants3,Chromosome,Start,`Gene name`, `Mutation ID`, `Mutation CDS`)
mutants4
The %>% operation allows us to write one line of R code that follows on from the previous. This results in “cleaner” code that is easier to read and doesn’t require us to create lots of variables that are only used once. In this case, our dataset is quite small. For larger datasets we wouldn’t really want to create too more copies than necessary.
filter(mutants, `Primary site` == "lung",`FATHMM prediction` == "PATHOGENIC") %>%
arrange(Chromosome, Start) %>%
select(Chromosome,Start,`Gene name`, `Mutation ID`, `Mutation CDS`)
filter(mutants, `Primary site` == "lung",`FATHMM prediction` == "PATHOGENIC") %>%
mutate(Chromosome = as.numeric(Chromosome)) %>%
arrange(Chromosome, Start) %>%
select(Chromosome,Start,`Gene name`, `Mutation ID`, `Mutation CDS`)
The tidyverse also provides visualisation via the ggplot function. The plotting is extremely flexible and can create publication-ready graphics. There is a useful cheatsheet
The gg in ggplot stands for “grammar of graphics” and allows us to “map” between variables in our dataset (such as the mutants data we have already been working with).
e.g. creating a barplot to show the frequency of various mutation types
ggplot(mutants, aes(x=`FATHMM prediction`)) + geom_bar()
mutants is the data that we want to plot; FATHMM prediction is the variable we want to show on the x-axis and geom_bar is the way that we want to display the data. We will re-visit other plotting types when we have imported a larger version of this dataset.
e.g. the distribution of gene lengths
ggplot(mutants, aes(x=`Gene CDS length`)) + geom_histogram()
Accessing the full dataset requires registration with COSMIC. Assuming you have completed that process you should be able to download a file CosmicMutantExportConsensus.tsv.gz containing all mutants in the Cancer Gene Consensus. This can be found under the heading All Mutations in Census Genes. It is around 40MB so takes a little bit of time to read into R.
If you are running this course in-person through a web browser, select the Upload option above the file explorer in RStudio (bottom-right) and select the location of your COSMIC download. The name of the file should appear in the file explorer window
(even though the file is compressed (.gz) we can still read it into R)
We use the same read_tsv command with the path to the new file.
consensus_mutants <- read_tsv("CosmicMutantExportCensus.tsv.gz")
Parsed with column specification:
cols(
.default = col_character(),
`Gene CDS length` = [32mcol_double()[39m,
ID_sample = [32mcol_double()[39m,
ID_tumour = [32mcol_double()[39m,
GRCh = [32mcol_double()[39m,
`FATHMM score` = [32mcol_double()[39m,
Pubmed_PMID = [32mcol_double()[39m,
ID_STUDY = [32mcol_double()[39m,
Age = [32mcol_double()[39m,
Tier = [32mcol_double()[39m
)
See spec(...) for full column specifications.
The data frame object now has 700K rows
consensus_mutants
The same select, filter etc operations apply.
e.g. What mutations are recorded for TP53?
filter(consensus_mutants, `Gene name`=="TP53")
e.g. What is the most commonly mutated gene in breast cancer?
Here we can use a handy function count to count the number of time unique entries in a particular column occur
filter(consensus_mutants, `Primary site`=="breast") %>%
count(`Gene name`) %>%
arrange(desc(n))
The %>% operation allows us to plot that data frames that we have manipulated with filter etc.
e.g. gWhat cancer types is TP53 most commonly mutated in?
filter(consensus_mutants, `Gene name`=="TP53") %>%
ggplot(aes(x=`Primary site`)) + geom_bar()
There are a few too many sites to see the plot properly, so we can further filter and also flip the axis so it is easier to read
filter(consensus_mutants, `Gene name`=="TP53") %>%
count(`Primary site`) %>%
filter(n > 1000) %>%
ggplot(aes(x=`Primary site`,y=n)) + geom_bar(stat="identity") + coord_flip()
The ordering of the bars is alphabetical by default. But it is more natural to reorder according to the counts.
filter(consensus_mutants, `Gene name`=="TP53") %>%
count(`Primary site`) %>%
filter(n > 1000) %>%
ggplot(aes(x=reorder(`Primary site`,n),y=n)) + geom_bar(stat="identity") + coord_flip()
To add colour to the plot, we can add an additional option to geom_bar, which would use the same colour for all the bars
filter(consensus_mutants, `Gene name`=="TP53") %>%
count(`Primary site`) %>%
filter(n > 1000) %>%
ggplot(aes(x=reorder(`Primary site`,n),y=n)) + geom_bar(stat="identity",fill="steelblue") + coord_flip()
Colours can also be mapped to columns in the data frame; in the same way that the x- and y- variables can
filter(consensus_mutants, `Gene name`=="TP53") %>%
count(`Primary site`) %>%
filter(n > 1000) %>%
ggplot(aes(x=reorder(`Primary site`,n),y=n,fill=`Primary site`)) + geom_bar(stat="identity") + coord_flip()
Different geom_s are possible depending on what kind of data you have.
Do TP53 mutations have a different age distribution in different cancer types?
filter(consensus_mutants, `Gene name`=="TP53") %>%
ggplot(aes(x=`Primary site`,y=Age)) + geom_boxplot() + coord_flip()
We will use the somatic calls that have previously been made for the HCC1143 breast cancer cell line. The calls were made as part of a Cancer Research Uk Bioinformatics Summer School.
The following workflow including tools from the Bioconductor project will be used
VariantAnnotationTxDb.... to annotate variants within coding regionsBsgenome.... to assess functionorg.Hs.eg.db to convert to between identifiersdplyr and use the functions from the previous sectionFirst we need to download the file and save locally.
url <- "https://github.com/bioinformatics-core-shared-training/cruk-summer-school-2016/raw/master/Day3/HCC1143_vs_HCC1143_BL.flagged.muts.vcf"
download.file(url, destfile="HCC1143_vs_HCC1143_BL.flagged.muts.vcf")
Although we could use the read_tsv function that we saw previously, the vcf file is complex. The VariantAnnotation package can be used to import the data into R.
Unlike a data frame, the whole object does not get printed to the screen we type raw_vcf and execute.
library(VariantAnnotation)
raw_vcf <- readVcf("HCC1143_vs_HCC1143_BL.flagged.muts.vcf")
raw_vcf
class: CollapsedVCF
dim: 92286 2
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 12 columns: DP, MP, GP, TG, TP, SG, SP, DS, SNP, ASRD, CLPM, ASMD
info(header(vcf)):
geno(vcf):
SimpleList of length 10: GT, FAZ, FCZ, FGZ, FTZ, RAZ, RCZ, RGZ, RTZ, PM
geno(header(vcf)):
To make the process go more smoothly, we will consider variants on chromosomes 1 to 22 only. The naming convention of the chromosomes also needs to be changed to be compatible with the annotation resources we are going to use.
raw_vcf <- keepSeqlevels(raw_vcf, 1:22,pruning.mode = "coarse")
seqlevelsStyle(raw_vcf) <- "UCSC"
genome(raw_vcf) <- "hg19"
The data from the vcf file has been separated into different parts of the object that can be accessed by a series of functions. The location of the variants can be returned using the rowRanges function that is part of VariantAnnotation.
var_locs <- rowRanges(raw_vcf)
var_locs
GRanges object with 87753 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
6c54cf8c-3ee9-11e6-9d63-a89a8835fc03 chr1 16103 * | <NA>
6c54d310-3ee9-11e6-9d63-a89a8835fc03 chr1 17222 * | <NA>
6c54d3ce-3ee9-11e6-9d63-a89a8835fc03 chr1 74570 * | <NA>
6c54d46e-3ee9-11e6-9d63-a89a8835fc03 chr1 101445 * | <NA>
6c54d504-3ee9-11e6-9d63-a89a8835fc03 chr1 101456 * | <NA>
... ... ... ... . ...
6dc4437a-3ee9-11e6-9d63-a89a8835fc03 chr22 51192990 * | <NA>
6dc44406-3ee9-11e6-9d63-a89a8835fc03 chr22 51238839 * | <NA>
6dc44492-3ee9-11e6-9d63-a89a8835fc03 chr22 51238846 * | <NA>
6dc44528-3ee9-11e6-9d63-a89a8835fc03 chr22 51239025 * | <NA>
6dc445b4-3ee9-11e6-9d63-a89a8835fc03 chr22 51239040 * | <NA>
REF ALT QUAL
<DNAStringSet> <DNAStringSetList> <numeric>
6c54cf8c-3ee9-11e6-9d63-a89a8835fc03 T G <NA>
6c54d310-3ee9-11e6-9d63-a89a8835fc03 A G <NA>
6c54d3ce-3ee9-11e6-9d63-a89a8835fc03 T A <NA>
6c54d46e-3ee9-11e6-9d63-a89a8835fc03 G C <NA>
6c54d504-3ee9-11e6-9d63-a89a8835fc03 A G <NA>
... ... ... ...
6dc4437a-3ee9-11e6-9d63-a89a8835fc03 C A <NA>
6dc44406-3ee9-11e6-9d63-a89a8835fc03 C T <NA>
6dc44492-3ee9-11e6-9d63-a89a8835fc03 G A <NA>
6dc44528-3ee9-11e6-9d63-a89a8835fc03 C T <NA>
6dc445b4-3ee9-11e6-9d63-a89a8835fc03 G A <NA>
FILTER
<character>
6c54cf8c-3ee9-11e6-9d63-a89a8835fc03 VUM;MQ;MN
6c54d310-3ee9-11e6-9d63-a89a8835fc03 MNP;VUM;MQ;MN
6c54d3ce-3ee9-11e6-9d63-a89a8835fc03 MQ;VUM
6c54d46e-3ee9-11e6-9d63-a89a8835fc03 MQ;MNP
6c54d504-3ee9-11e6-9d63-a89a8835fc03 MNP;MQ
... ...
6dc4437a-3ee9-11e6-9d63-a89a8835fc03 MNP
6dc44406-3ee9-11e6-9d63-a89a8835fc03 MQ
6dc44492-3ee9-11e6-9d63-a89a8835fc03 RP;MQ
6dc44528-3ee9-11e6-9d63-a89a8835fc03 MQ;MNP;MN
6dc445b4-3ee9-11e6-9d63-a89a8835fc03 MN;MQ
-------
seqinfo: 22 sequences from hg19 genome
We notice that the vcf file already has some filters applied for quality. Variants that pass all the filters have an entry of PASS. We will use these variants in subsequent analyses.
N.B. the data aren’t yet in a form that we can use the dplyr operations from previously, so we have to use a different approach to filter the rows.
pass_vcf <- raw_vcf[var_locs$FILTER == "PASS"]
pass_locs <- rowRanges(pass_vcf)
The Bioconductor project also distributes several annotation packages that can be used to annotate our results. The collection of TxDb.... packages comprise definitions of transcripts for a given genome build.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
The locateVariants function can be used to find where the variants are located with respect to different types of genomic feature. Various types of match including CodingVariants that will find variants inside coding regions. The output is converted to a data frame so we can use the dplyr functions we saw earlier
We clean-up the output so only certain columns are displayed. The QUERYID column needs to be kept to refer back to the complete set of variants in the filtered vcf file. The GENEID column will be used later to map to other annotation resources.
var_annotations <- locateVariants(pass_locs, txdb, CodingVariants())
GRanges object contains 653 out-of-bound ranges located on sequences 7, 8, 220,
4318, 4319, 4320, 4321, 4323, 4708, 5800, 6512, 2803, 3144, 7538, 7539, 7930,
7931, 11490, 8897, 9119, 9249, 12212, 12213, 12485, 12590, 12833, 10288, 13104,
14225, 14387, 22341, 20690, 20691, 23089, 21780, 21788, 21789, 21790, 21791,
27655, 30453, 29152, 29153, 29159, 31222, 29398, 33491, 33492, 33493, 34133,
34135, 34154, 32887, 35157, 35796, 37617, 39458, 39459, 39460, 39461, 39544,
39545, 39546, 39547, 39548, 39549, 39794, 39795, 40302, 40716, 40717, 40718,
40719, 40721, 40722, 40723, 40724, 40726, 40727, 44317, 42208, 42209, 44941,
44942, 45684, 45685, 48678, 48720, 48721, 49165, 49404, 50849, 52689, 51422,
51423, 51590, 53239, 53246, 53247, 53248, 53497, 53498, 53540, 53541, 53542,
53543, 53544, 53865, 53866, 55771, 55772, 56334, 56337, 56698, 56701, 56702,
56703, 56704, 56705, 57211, 58018, 58020, 64269, 64270, 64271, 64272, 64273,
64274, 65311, 65112, 65813, 65903, 66184, 69031, 69032, 66684, 66687, 67037,
67691, 67692, 67693, 67694, 70538, 70539, 73793, 73797, 73798, 73799, and 74949.
Note that ranges located on a sequence whose length is unknown (NA) or on a
circular sequence are not considered out-of-bound (use seqlengths() and
isCircular() to get the lengths and circularity flags of the underlying
sequences). You can use trim() to trim these ranges. See
?`trim,GenomicRanges-method` for more information.
var_annotations <- as.data.frame(var_annotations) %>%
dplyr::select(QUERYID,GENEID) %>% unique %>%
filter(!is.na(GENEID))
var_annotations
With the predictCoding function we can assess whether the mutations are going to result in amino acid change or not. This makes use of the Genome sequence, which is available in the BSgenome.Hsapiens.UCSC.hg19 package.
Again we only use certain columns from the output.
library(BSgenome.Hsapiens.UCSC.hg19)
var_predictions <- predictCoding(pass_vcf, txdb, seqSource = Hsapiens)
GRanges object contains 653 out-of-bound ranges located on sequences 7, 8, 220,
4318, 4319, 4320, 4321, 4323, 4708, 5800, 6512, 2803, 3144, 7538, 7539, 7930,
7931, 11490, 8897, 9119, 9249, 12212, 12213, 12485, 12590, 12833, 10288, 13104,
14225, 14387, 22341, 20690, 20691, 23089, 21780, 21788, 21789, 21790, 21791,
27655, 30453, 29152, 29153, 29159, 31222, 29398, 33491, 33492, 33493, 34133,
34135, 34154, 32887, 35157, 35796, 37617, 39458, 39459, 39460, 39461, 39544,
39545, 39546, 39547, 39548, 39549, 39794, 39795, 40302, 40716, 40717, 40718,
40719, 40721, 40722, 40723, 40724, 40726, 40727, 44317, 42208, 42209, 44941,
44942, 45684, 45685, 48678, 48720, 48721, 49165, 49404, 50849, 52689, 51422,
51423, 51590, 53239, 53246, 53247, 53248, 53497, 53498, 53540, 53541, 53542,
53543, 53544, 53865, 53866, 55771, 55772, 56334, 56337, 56698, 56701, 56702,
56703, 56704, 56705, 57211, 58018, 58020, 64269, 64270, 64271, 64272, 64273,
64274, 65311, 65112, 65813, 65903, 66184, 69031, 69032, 66684, 66687, 67037,
67691, 67692, 67693, 67694, 70538, 70539, 73793, 73797, 73798, 73799, and 74949.
Note that ranges located on a sequence whose length is unknown (NA) or on a
circular sequence are not considered out-of-bound (use seqlengths() and
isCircular() to get the lengths and circularity flags of the underlying
sequences). You can use trim() to trim these ranges. See
?`trim,GenomicRanges-method` for more information.
names(var_predictions) <- make.names(names(var_predictions),unique = TRUE)
var_predictions <- as.data.frame(var_predictions) %>%
dplyr::select(QUERYID,CONSEQUENCE,REFCODON,VARCODON,REFAA,VARAA) %>% unique
var_predictions
Now the data are in data frames, we can join the two tables together with dplyr.
var_annotations <- left_join(var_annotations,var_predictions,by="QUERYID")
var_annotations
The gene names are currently Entrez ID, which although useful for mapping, do not give particularly memorable names that we can search for. We use a pre-built annotation package org.Hs.eg.db to obtain gene symbols.
library(org.Hs.eg.db)
gene_anno <- AnnotationDbi::select(org.Hs.eg.db,
columns="SYMBOL",
keys=var_annotations$GENEID,
keytype = "ENTREZID")
'select()' returned many:1 mapping between keys and columns
gene_anno <- dplyr::rename(gene_anno, "GENEID"=ENTREZID)
var_annotations <- left_join(var_annotations, gene_anno) %>% unique
Joining, by = "GENEID"
var_annotations
We can put everything together into one table. The locations from the vcf need to be converted into a data frame including a QUERYID column that can be matched to the annotations. This is defined to be the row number.
final_tab <- as.data.frame(pass_locs) %>%
mutate(QUERYID = 1:n()) %>% dplyr::select(-width,-paramRangeID,-ALT,-QUAL,-FILTER,-strand) %>%
inner_join(var_annotations, by="QUERYID") %>%
dplyr::select(seqnames:end,SYMBOL,REF:VARAA,-QUERYID)
final_tab
The dplyr functions and ggplot2 introduced in the previous section can now be applied to our list of variants.
filter(final_tab, CONSEQUENCE=="nonsense")
What genes are most commonly mutated?
dplyr::count(final_tab, SYMBOL) %>% arrange(desc(n))
How many variants are located on each chromosome?
ggplot(final_tab, aes(x=seqnames,fill=CONSEQUENCE)) + geom_bar(position="dodge") + coord_flip()
For convenience, we chose to run this workshop through a web-browser on the 22nd January. After the workshop, you will need to install both R and RStudio on your own machine and install some R packages.
Install R by downloading and running this .exe file from CRAN. Also, please download and run the RStudio installer for Windows.
Then open RStudio and paste the following lines of R code into a console and press Enter
install.packages("BiocManager")
BiocManager::install("tidyverse")
BiocManager::install("VariantAnnotation", version = "3.8")
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene", version = "3.8")
BiocManager::install("org.Hs.eg.db", version = "3.8")
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19", version = "3.8")
Install R by downloading and running this .pkg file from CRAN. Also, please download and run the RStudio installer for Mac
Then open RStudio and paste the following lines of R code into a console and press Enter
install.packages("BiocManager")
BiocManager::install("tidyverse")
BiocManager::install("VariantAnnotation", version = "3.8")
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene", version = "3.8")
BiocManager::install("org.Hs.eg.db", version = "3.8")
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19", version = "3.8")